Reading data

We start by reading in the data. The programs reads in the output h5 file filtered_feature_bc_matrix.h5 of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column). To improve the quality of data, we include features detected in at least 3 cells and include cells where at least 200 features are detected. Other features and cells will be dropped. Finally, we get 18746 features and 7530 cells.

## An object of class Seurat 
## 18746 features across 7530 samples within 1 assay 
## Active assay: RNA (18746 features, 0 variable features)

QC and selecting cells for further analysis

Before analysing the single-cell gene expression data, we must ensure that all cellular barcode data correspond to viable cells. Cell quality control is performed based on four QC covariates:

  1. The number of unique genes detected in each cell (nFeature_RNA). Low-quality cells or empty droplets will often have very few genes. Cell doublets or multiplets may exhibit an aberrantly high gene count.

  2. The total number of genes detected within a cell (correlates strongly with unique genes) (nCount_RNA)

  3. The percentage of reads that map to the mitochondrial genome (percent.mito). Low-quality / dying cells often exhibit extensive mitochondrial contamination. We use the set of all genes starting with MT- as a set of mitochondrial genes

  4. The percentage of reads that map to the ribosomal genome (percent.ribo).

The violin plot for those four QC metrics was shown below:

Then we filter cells that have unique feature counts over 7500, genes detected within a cell more than 70000. We filter cells that have >15% mitochondrial counts and >20% ribosomal counts.

Note: we observed the high percentage of the mitochondrial counts, which may indicate the potential issues in sequencing, such as the poor sample quality, leading to a high fraction of apoptotic or lysing cells.

After filtering, the number of cells and violin plot is as follows:

sample before_qc after_qc
Total 7530 2991
stomach 7530 2991

Normalizing the data

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

Identification of highly variable features (feature selection)

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. The figure below labled the top-10 highly variable features.

Scaling the data

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. This step:

  1. Shifts the expression of each gene, so that the mean expression across cells is 0
  2. Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

Clustering the cells

Here we apply a graph-based clustering approach, building upon initial strategies in (Macosko et al). Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.

We first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). To cluster the cells, we next apply modularity optimization techniques(Louvain algorithm) to iteratively group cells together, with the goal of optimizing the standard modularity function.

After clustering, we get 9 clusters. Then we use UMAP to visualize and explore these clusters The goal of UMAP is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots.

The cell number of each cluster listed below:

## 
##   0   1   2   3   4   5   6   7 
## 882 704 585 460 209  67  46  38

Visualizing marker expression

Here we use violin plot to show expression probability distributions across clusters. Note that the marker ‘Car1’ was not found in the features.

The following violin plots and feature plots of three groups genes from ‘Basu_Expectation.pdf’.

Group 1

Group 2

Group 3

Appended

The plots for Kdr gene were required in the email on December 28.

The plots for Drd1, Drd2, Drd3, Drd4 and Drd5 genes were required in the email on November 4.

The dopamine and dopamine receptors (D1, D2, D3, D4 and D5) expressed in few cells. In stomach, the Drd1 expressed in 2 cells, Drd2 expressed in 2 cells, Drd3 expressed in 1 cell, Drd4 and Drd5 have no expression. In colon, the Drd1 expressed in 1 cell, Drd2 expressed in 0 cell, Drd3 expressed in 1 cell, Drd4 expressed in 54 cells, and Drd5 have no expression. In the QC part, they were filtered. We can’t get the violin plots for them.

The plots for Th gene were required in the email on October 24.

The markers for annotation

  1. Epithelial cluster : EPCAM and panCK (KRT4, KRT6, KRT7, KRT8, KRT10, KRT17, KRT18, KRT19 and KRT20)

  1. Blood Vessel Endothelial cells(BEC): PECAM-1/CD31, Tie2/TEK and Kdr/FLK-1

  1. Lymphatic Endothelial cells(LEC) : LYVE-1, PROX1 and Flt4

Subclustering

In stomach Annotations:

Epithelial Cluster: Cluster 0

BEC Cluster: Cluster 2

LEC Cluster : Cluster 5

UMAP plots for those three clusters

Subclustering for those three clusters

Epithelial subclustering

Epithelial Cluster: EPCAM and panCK (KRT4, KRT6, KRT7, KRT8, KRT10, KRT17, KRT18, KRT19 and KRT20)

BEC subclustering

BEC Cluster: PECAM-1/CD31, Tie2/TEK and Kdr/FLK-1

LEC subclustering

LEC Cluster : LYVE-1, PROX1 and Flt4

To show significant results, we don’t find subclusters for cluster LEC because of the low number of cells.

Dot plots